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DATA QUALITY AND INTEGRITY ENGINE 



FIELD OF THE INVENTION 

The present invention relates generally to database systems, and in 
particular to data warehousing techniques. 

BACKGROUND 

All types of organisations, business entities, and persons may own legacy 
database systems that have been acquired at different times. A business may rely 
upon a particular database or transaction system to handle data aggregation and 
processing functions for a part of a business. Because of investment, knowledge, and 
experience with that system, an organisation or entity may choose not to replace such 
a system, for example simply to avail itself of the stability of the system. Later, 
another database or transaction system may be acquired and used to handle a different 
aspect of the business. In this manner, an entity may ultimately operate a number of 
database systems that do not interact well or at all with one another. In a similar 
manner, an entity may have its own database or transaction system and need to 
interact with a number of different database or transaction systems of other entities. 
For example, a number of entities may be working collaboratively on a project, but 
each have their own database or transaction systems. 

One approach to resolving this problem is to mandate the use of a standard 
database system throughout the entity or entities. However, this may not be possible 
or desirable for a number of reasons. For example, an entity working collaboratively 
with others for a short-term project may consider this to be too onerous of a 
requirement and therefore unjustifiable. 

Data warehouses have attempted to address this problem to collect data 
from various sources, but suffer from a number of disadvantages. Fig. 1 illustrates a 
system 100 in which a data warehouse 102 receives data from a number of databases 
1 10-122, which is used to produce deliverable data 130. However, such a data 

i 
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warehouse 102 produces mismatches in the data 130. This results from errors in the 
data itself (e.g. due to data entry problems), synchronization problems (e.g., a 
database may not yet have been amended), and conceptual differences. Relevant 
conceptual differences include like fields not having the same name, unlike fields 
5 having the same name, like fields having different definitions and/or formats, and like 
entities having different attributes, to name a few. 

There has been little synergy between various databases in such 
circumstances, and users may need to learn a number of different application to find 
1 0 information the users need from such disparate databases. 

Thus a need clearly exists for an improved method of ensuring quality and 
integrity of data from a data source. 

15 SUMMARY 

In accordance with a first aspect of the invention, a method of ensuring 
data quality and integrity of a data set derived from a data source is provided. The 
method includes the steps of: obtaining data from the data source; and building a data 
repository using the data from the data source. The data repository includes a data 

20 structure that forms a model of the data from the data source. The building step 

includes the steps of: applying business rules from a rules database to the data from 
the data .source, where the business rules dependent upon meta data; and detecting 
any errors in the data and storing data satisfying the business rules in the data 
repository. 

25 

A log of any detected errors may be maintained in the data repository. 
Preferably, the detected errors are reported for correction of the errors in the data 
source. Optionally, an integrated data set can be provided for export from the data 
repository. 

30 
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Optionally, the data source includes a plurality of database systems and/or 
transaction systems. The method may include the step of storing the data from the 
plurality of systems in a staging area. More preferably, the model is an enterprise- 
level model and the business rules are enterprise level business rules 

The method may include the step of feeding back the errors to the data 
source for correction. Further, at least a portion of data of the data source is corrected 
dependent upon an error fed back to the data source. 

Preferably, the applying step includes the step of invoking procedures 
stored in the data repository. The meta data may be stored in the data repository. 
Optionally, the data from the data source is loaded into a staging area. Further, the 
method includes the step of triggering the building step. The rules database includes 
one or more attributes for each rule selected from the group consisting of: rule type, 
rule name, a text description of the rule, rule syntax, invocation of the rule, reporting 
of erroneous data to the enterprise-level model, name of a stored procedure for 
checking the rule, rule precedence, a target table identifier, a target column name , 
activation status of the rule, status information of whether or not the rule is required 
for complete data quality and integrity, an error identifier, status information of 
whether or not the rule is traceable back to the data from the transaction systems, and 
a parameter list, if required by the stored procedure. 

Preferably, each rule of the rules database includes a SQL statement. 

In accordance with further aspects of the invention, a system and a 
computer program product for ensuring data quality and integrity of a data set derived 
from a data source are provided that implement the method of the foregoing aspect. 

BRIEF DESCRIPTION OF THE DRAWINGS 

A small number of embodiments of the invention are described hereinafter with 
reference to the drawings, in which: 
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Fig. 1 is a block diagram of a system using a data warehouse to provide 
deliverable data; 

Fig. 2 is a block diagram of a data quality and integrity engine for data 
from a plurality of different database or transaction systems in accordance with an 
embodiment of the invention; 

Fig. 3 is a flow diagram of a representative process for loading data into a 
data repository that can be used in the system of Fig. 2; and 

Fig. 4 is a flow diagram illustrating the process of the data quality and 
integrity engine of Fig. 2. 

DETAILED DESCRIPTION 

A method, a system, and a computer program product for ensuring data 
quality and integrity of a data set derived from a data source are described. Numerous 
specific details are set forth in the following description including particular mobile 
phone networks such as data interchange formats, database systems, and the like. 
However, it will be apparent to those skilled in the art in the light of this disclosure 
that modifications and/or substitutions may be made without departing from the scope 
and spirit of the invention. In other instances, well-known details may be omitted so 
as not to obscure the invention. 

Some portions of the description that follows are explicitly or implicitly 
presented in terms of algorithms and representations of operations on data within a 
computer system or other device capable of performing computations. These 
algorithmic descriptions and representations are the means used by those skilled in the 
data processing arts to most effectively convey the substance of their work to others 
skilled in the art. An algorithm is here, and generally, conceived to be a self- 
consisteat sequence of steps leading to a desired result. The steps are those requiring 
physical manipulations of physical quantities. Usually, though not necessarily, these 
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quantities take the form of electrical, or magnetic, or electromagnetic signals capable 
of being stored, transferred, combined, compared, and otherwise manipulated. It has 
proven convenient at times, principally for reasons of common usage, to refer to these 
signals as bits, values, elements, symbols, characters, terms, numbers, or the like. 

5 

It should be borne in mind, however, that the above and similar terms are to 
be associ ated with the appropriate physical quantities and are merely convenient 
labels applied to these quantities. Unless specifically stated otherwise, and as 
apparent from the following, it will be appreciated that throughout the present 

10 specification, discussions utilizing terms such as "executing", "loading", "sending", 
"receiving", "fetching", "storing" "waiting", "reporting", or the like, refer to the 
action and processes of a computer system, or similar electronic device, that 
manipulates and transforms data represented as physical (electronic) quantities within 
the registers and memories of the computer system into other data similarly 

15 represented as physical quantities within the computer system memories or registers 
or other such information storage, transmission or display devices. 

The present specification also discloses an apparatus or system for 
performing the operations of the methods. Such an apparatus may be specially 

20 construct ed for the required purposes, or may include a general-purpose computer or 
other device selectively activated or reconfigured by a computer program stored in the 
computer. The algorithms and displays presented herein are not inherently related to 
any particular computer or other apparatus. Various general-purpose machines may 
be used with programs in accordance with the teachings herein. Alternatively, the 

25 construct ion of more specialized apparatus to perform the required method steps may 
be appropriate. The structure of a conventional general-purpose computer appears 
from the description below. 

In addition, the present invention also implicitly discloses a computer 
30 program(s) or software, in that it would be apparent to the person skilled in the art that 
the individual steps of the preferred method described herein may be put into effect by 
computer code. The computer program is not intended to be limited to any particular 
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programming language and implementation thereof. It will be appreciated that a 
variety of programming languages and coding thereof may be used to implement the 
teachings of the disclosure contained herein. Moreover, the computer program is not 
intended to be limited to any particular control flow. There are many other variants of 
the computer program, which can use different control flows without departing the 
spirit or scope of the invention. Furthermore one or more of the steps of the computer 
program may be performed in parallel rather than sequentially. 

Such a computer program may be stored on any computer readable medium. 
The computer readable medium may include storage devices such as magnetic or 
optical disks, memory chips, or other storage devices suitable for interfacing with a 
general-purpose computer. The computer readable medium may also include a hard- 
wired medium such as exemplified in the Internet system, or wireless medium such as 
exemplified in the GSM mobile telephone system. The computer program when 
loaded and executed on such a general-purpose computer effectively results in an 
apparatus that implements the steps of the preferred method. 

The preferred method(s) comprise a particular control flow. There are 
many other variants of the preferred method(s),s which use different control flows 
without departing the spirit or scope of the invention. Furthermore one or more of the 
steps of the preferred method(s) may be performed in parallel rather sequential. 

Overview 

The embodiments of the invention provide a data quality and integrity 
engine (DQIE) that is able to enforce business rules on data from a data source. The 
data source may include one or more databases, warehouses, and transaction systems. 
This is achieved by downloading data from the data source satisfying the business 
rules into a data repository that includes a data structure that forms a Model of the 
data. Preferably, the Model is an enterprise model (EM). Errors are detected by the 
DQIE and automatically reported back to the Data Owner(s) of the data source, where 
the errors can be corrected at the source. 
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The DQIE can be used to integrate data into a single data set where the 
source data is derived from disparate Transaction Systems or databases. The DQIE 
enables business rules to be established, managed, and enforced. Preferably, the rules 
are enterprise level business rules. Further, data from disparate database systems can 
5 be delivered as an integrated data set. This reduces the costs of data management and 
business requirements. 

By creating an enterprise model, enterprise-level business rules can be 
easily est ablished and enforced on this enterprise model, rather than the source data. 

10 

System 2 00 

Fig. 2 is a block diagram illustrating an embodiment of the invention for 
ensuring data quality and integrity for data derived a data source. Here the data 
source is preferably, but optionally, several different transaction systems. The system 
15 200 of Fig. 2 includes transaction systems 210, a data warehouse 220, and a data 
quality and integrity engine 250 and an associated rules database 252 that provide a 
virtual quality firewall 240 for the data warehouse. 

The transaction systems 210 include a number of individual transaction 
20 systems 2 1 OA, 2 1 0B, . . . , 2 1 0C, which periodically load data 2 1 2 into the data 
warehouse 220. The individual transaction systems 21 OA, 21 0B, . . ., 2 10C may 
poorly interact with one or more of the other transaction systems, or not at all, making 
the enforcement of enterprise-level business rules across the transaction systems 
2 1 OA, 2 1 0B, ...,21 0C, impossible or impracticable. 

25 

A staging area 242 receives the data 212 periodically loaded from the 
transaction systems 210. Rule by rule and row by row, data in the staging area 242 is 
accessed by the data quality and integrity engine (DQIE) 250. Individual data values 
are retrieved by the DQIE 250 from the staging areas 242 and checked for such things 
30 as range, format, uniqueness or relationship to other data values. The arrow 260 
generally indicates that data is sampled by the DQIE 250 to check values and 
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relationships. Little or no business rules are applied to the data 212 loaded into the 
staging area 242. The staging area 242 receives both good and bad data. Data 
transform rules are applied between the transaction systems and the staging area, 
which may produce an intermediate file. Data may be brought into the staging area 
5 242 in using variable character field text, for example. As is explained in greater 
detail hereinafter with regard to the DQIE 250, a virtual quality firewall 240 is 
maintained between the staging area 242 and the data warehouse or repository 220. 
The DQIE 250 populates the warehouse data 222 with data from the staging area 242, 
and thus controls the flow of data from the staging area 242 into the warehouse data 
10 222. 

The data warehouse 220 includes warehouse data 222, meta data 224, an 
error log 226, an error history 228, and stored procedures 230. The heart of the data 
warehouse is the relational store and this is where the Enterprise Model resides. Also 

1 5 business rules are checked and the data history is maintained. The meta data 224 

stores information about the structure and relationships within the database 222. For 
example, there is preferably a table called "Table Joins". This table contains table and 
Column Ids, together with the type of join and constraints, if any on the data range. 
By storing this information in a table, the DQIE 250 can automatically execute a 

20 single Store Procedure 230 on a number of different tables. For example, a single rule 
can check for orphan rows in a parent/child relationship between many tables. Other 
meta data includes Display Names to be used for Tables and Columns. Regarding the 
stored procedures 230, many modem database engines like Oracle and Microsoft SQL 
Server incorporate the ability to store executable procedures and triggers at the 

25 database level. Often stored procedures 230 are executed by triggers or other 

applications. The stored procedures 230 are the "teeth" of the DQIE 250 and are 
invoked by the DQIE 250. These procedures 230, together with parameter lists and 
SQL statements (both stored in the rules database 252) act together to check and 
enforce the business rules. All the procedural parts of the rules may be stored as SQL 

30 in the rules database 252, but are preferably and conveniently stored and run as the 
executable part of the rules as stored procedures 230. 
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The error log 226 provides input 21 8 to the error history 228. The data 
quality and integrity engine 250 is coupled to a rules database that contains the 
enterprise business rules. The rules database is separate from the data quality and 
integrity engine 250. The meta data 224 is coupled to the rules database 252. The 
DQIE 250 has access to the warehouse data 222. Further, the DQIE 250 provides 
error data 254 based on the warehouse data 222 to the error log 226 and can invoke 
the stored procedures 230. Good data produced using the DQIE 250 can be exported 
as integrated data set for data delivery. 

The periodic loading process of data 212 to the staging area 242 also 
triggers 214 the DQIE 250. Also, the DQIE 250 notifies the transaction systems 210 
when errors are discovered in the source data, so that the source data may be 
corrected. These and other aspects of the system 200 are described in further detail 
hereinafter. 

Rule Database 252 

The Rule Database 252 is a key aspect of the DQIE 250. The Rule 
Database 252 contains both data and meta-data that fully describe each Rule. The rule 
may be implemented using a SQL statement, for example. Importantly, the rules are 
not coded in the DQIE 250. That is, the rules are independent of the DQIE 250. This 
structure allows many of the rule attributes to be managed by system administrators, 
without the need for reprogramming. The data of this Rule Database 252 includes the 
following attributes: 

• Rule type, 

• The rule name, 

• A plain English description of the rule, 

• Rule syntax, 

• At what point in the process the rule should be invoked, 

• Whether or not errant data should progress to the Enterprise Model 220, 

• The name of the Stored Procedure 230 that checks the rule, 
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The rule precedence, 

The target Table ID, 

The target Column Name , 

Whether or not the rule is Active (On/Off), 

Whether or not the rule is necessary for complete Data quality and integrity, 
Error ID, 

Whether or not the rule is Traceable back to the Source Data, and 
A parameter list, if required by the Stored Procedure 230. 



10 Data History 

As data is downloaded into the warehouse data 222, the data is compared 
with previous records, based on their Primary Key values. The result of this 
comparison allows each record to be marked as an Add, Modify, or Delete. This also 
allows a data history to be maintained by storing the changes in history tables. The 

15 DQIE 250 also uses this Data History feature to ensure that the "Current" view of the 
data only includes "good" data. Preventing "bad" data from being included in the 
"current" view foims the virtual Quality Firewall 240. 

Quality Firewall 240 

20 Triggered by the data load and driven by the DQIE, data flows through 

various sets of tables in the data warehouse, from the Staging Area, through to the 
Enterprise Model (EM). Depending on the rule meta-data, failing to meet certain 
rules can prevent "bad" data from progressing through to the EM, thereby retaining 
past records as "current" data. The action of the DQIE to prevent "bad" data from 

25 reaching the EM, forms a virtual "Quality Firewall". The DQIE 250 is preferably 
implemented as software. Likewise the firewall 240 produced by the DQIE 250 can 
prevent bad data from moving out of the data warehouse 220. 

Error Storage 226 

30 Error data 254 is stored in a series of Error Tables 226 that mimic the 

table names, in which the errors occurred. These tables store meta-data about each 
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breach of every rule. These Error Tables include data such as the Primary Key value, 
the Rule ED, and in some instances the Column value, where the actual source value 
did not meet the column constraints. 

5 Therefore, errors can be traced down to the column- and row-levels and 

displayed to the user, even if the errant source data fails to meet the column 
definitions in the enterprise-level model. Time stamping each row in the error tables, 
allows the Error History 21 8 to be viewed either by Table or by Rule. 

10 Error Reporting 

Preferably, the DQIE 250 does not correct errors. Instead, errors are 
reported to the Data Owners of the source Transaction Systems 210. This reporting is 
preferably done by e-mail, but other mechanisms may be practiced without departing 
from the scope and spirit of the invention. Data Owners then view the errors using a 

15 User Interface (UI), but correct the errors in the source Transaction Systems 210. 

Loading Process 

Fig. 3 is a flow diagram illustrating the process 300 of loading data into 
the data warehouse 220 of Fig. 2. Processing commences in step 302. In decision 
20 step 304, a check is made to determine if a specified time and/or data has been 

reached (e.g., 1 AM on Monday). If step 304 returns false (NO), processing continues 
at step 306, in which a specified period of time (e.g. one hour) is allowed to elapse. 
Processing then continues at step 304. If step 304 returns true (YES), processing 
continues at step 308. 

25 

In step 308, a check is made for the presence of files to be downloaded 
from the transaction systems 210 of Fig. 2. Preferably, a script 310 creates the 
download files. This may done periodically (e.g. once a week), and the download 
files 312 produced are checked by step 308. In decision step 314, a check is made if 
30 all download files are available. If step 314 returns false (NO), a specified period of 
time (e.g., one hour) is allowed to elapse in step 316. From there, processing 
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continues at step 308. If step 314 returns true (YES), processing continues at step 
318. In step 318 the process of loading data commences. In step 320, a control loop 
is entered to process all files. Preferably, step 320 implements a for loop. Processing 
continues at decision step 322 for the current file. 

5 

In step 322, a check is made to determine the data meet or satisfy at least a 
subset of the business rules 252. If step 322 returns false (NO), processing continues 
at step 324 and an error log is created. Processing then continues at step 320 for the 
next file. Otherwise, if step 322 returns true (YES), processing continues at step 326. 
10 In step 326, the date for the current file is placed into the staging area 327 (242 of Fig. 
2). The next file is then checked at decision step 328, which checks to see if the next 
file is the last file to be processed in the for loop. If decision step 328 returns false 
(NO), processing continues at step 320. Otherwise, if step 328 returns true (YES), 
processing continues at step 330. 

15 

In step 330, loading into the relational store (222) commences. In step 
332, a control loop is entered to process all files. Preferably, step 332 implements a 
for loop. Processing continues at decision step 334 for the current file. In step 334, a 
check is made to determine if the data of the current file satisfies all relevant business 

20 rules of the rules database 252. If step 334 returns false (NO), processing continues in 
step 336. In step 336, an entry in the error log 226 is created for this file. Processing 
of the next file continues at step 332. Otherwise if step 334 returns true (YES), 
processing continues at step 338. In step 338, the data is moved into the relational 
store 340 (222) and history files 342. Processing then continues with the next file at 

25 step 344. 

In step 344, a check is made to determine if the last file has been reached. 
If step 344 returns false (NO), processing continues at step 332. Otherwise, if 
decision step 344 returns true (YES), processing continues at step 346. In step 346, 
30 completion of the data load is reported. The report is preferably sent via email to a 
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system administrator. In step 348, errors are reported in an error report 350 to the 
transaction systems 210. In step 352, processing terminates. 

Data Quality and Integrity Engine Process 
5 Fig. 4 is a flow diagram illustrating the processing 400 of the data quality 

and integrity engine 250 of Fig. 2. In step 402, processing commences. In step 404, a 
check is made to determine if the specified time for loading data has been reached. If 
step 404 returns false (NO), processing continues at step 406. In step 406, a specified 
or given period of time (e.g., one hour) is allowed to elapse. Processing then returns 

10 to set 404. If step 404 returns true (YES), processing continues at step 408. In step 
408 data is loaded in the manner of Fig. 3. A control loop (e.g. a do while or for 
loop) is started. In step 412, an enterprise-level business rule from the rule database 
416 (252 in Fig. 2) is executed using the stored procedures 414 (230 in Fig. 2) on the 
data. In step 418, meta data is fetched 420 (224 in Fig. 2), as required. In step 422, 

15 any resulting error data 424 is stored in the error history 426 (228 in Fig. 2). In step 
428, if the last rule has not been executed, processing continues at step 410. 
Otherwise processing terminates in step 430. 

The data quality and integrity engine (DQIE) thereby advantageously 
20 establishes, manages, and enforces Enterprise-Level business rules across a number of 
disparate Transaction Systems. Further, the DQIE detects errors in the data and 
reports this back to the Data Owners, so that the errors can be corrected at the source. 
The DQIE integrates data into a single data set where the source data is derived from 
disparate Transaction Systems or databases. Further the separate Rules Database 
25 associated with the DQIE allows easy maintenance of the enterprise-level business 
rules. 

The DQIE has the following advantages: 

- Because the rules are separate from the DQIE, the code within the 
30 DQIE can be "generic" and capable of executing any rule; 
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- By editing the meta data via a suitable user interface, rules can be 
managed by a non-programmer; 

- Rules can be easily added, deleted, or edite; and 

The rule meta data, including title and descriptive text, can be viewed 
5 by users. This allows users to relate particular breaches to the actual 

rule and to make comment where appropriate. 

Computer Implementation 

The method of ensuring data quality and integrity of a data set derived from a 

10 data repository is preferably practiced using one or more general-purpose computer 
systems and handheld devices, in which the processes of Figs. 1 and 2 may be 
implemented as software, such as an application program executing within the 
computer system or handheld device. In particular, the steps of method of ensuring 
data quality and integrity of a data set derived from a data repository are effected, at 

15 least in part, by instructions in the software that are carried out by the computer. The 
instructions may be formed as one or more code modules, each for performing one or 
more particular tasks. The software may be stored in a computer readable medium, 
including the storage devices described below, for example. The software is loaded 
into the computer from the computer readable medium, and then executed by the 

20 computer. A computer readable medium having such software or computer program 
recorded on it is a computer program product. 

Examples of computers on which the described arrangements can be 
practised include IBM-PC's and compatibles, Sun Sparcstations or alike computer 

25 systems. Still further, the software can also be loaded into the computer system from 
other computer readable media. The term "computer readable medium" as used 
herein refers to any storage or transmission medium that participates in providing 
instructions and/or data to the computer system for execution and/or processing. 
Examples of storage media include floppy disks, magnetic tape, CD-ROM, a hard 

30 disk drive, a ROM or integrated circuit, a magneto-optical disk, or a computer 

readable card such as a PCMCIA card and the like, whether or not such devices are 
internal or external of the computer module. Examples of transmission media include 
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radio or infra-red transmission channels as well as a network connection to another 
computer or networked device, and the Internet or Intranets including e-mail 
transmissions and information recorded on Websites and the like. 

5 A small number of embodiments of the invention regarding a method, a 

system, and a computer program product for ensuring data quality and integrity of a 
data set derived from a data source have been described. In the light of the foregoing, 
it will be apparent to those skilled in the art in the light of this disclosure that various 
modifications and/or substitutions may be made without departing from the scope and 
1 0 spirit of the invention. 
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The claims defining the invention are as follows: 

1 . A method of ensuring data quality and integrity of a data set 
derived from a data source, said method including the steps of: 

obtaining data from said data source; 

building a data repository using said data from said data source, said data 
repository including a data structure that forms a model of said data from said data 
source, said building step including the steps of: 

applying business rules from a rules database to said 
data from said data source, said business rules dependent 
upon meta data; and 

detecting any errors in said data and storing data 
satisfying said business rules in said data repository. 

2. The method according to claim 1; further including the step of 
reporting said detected errors for correction of said errors in said data source. 

3. The method according to claim 1 , further including the step of 
providing an integrated data set for export from said data repository. 

4. The method according to claim 1 , wherein said data source 
includes a plurality of transaction systems. 

5; The method according to claim 4, further including the step of 

storing said data from said plurality of transaction systems in a staging area. 

6. The method according to claim 1 or 5, wherein said model is 

an enterprise-level model and said business rules are enterprise level business rules 
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7. The method according to claim 1, further including the step of 
feeding back said errors to said data source for correction. 

8. The method according to claim 7, further including the step of 
correcting at least a portion of data of said data source dependent upon an error fed 
back to said data source. 

9. The method according to claim 1, wherein said applying step 
includes Ihe step of invoking procedures stored in said data repository. 

10. The method according to claim 1 , wherein said meta data is 
stored in said data repository. 

11. The method according to claim 1, including the step of 
loading said data from said data source into a staging area. 

12. The method according to claim 1 1, further including the step 
of triggering said building step. 

13. The method according to claim 1, wherein said rules database 
includes one or more attributes for each rule selected from the group consisting of: 

rule type, 
rule name, 

a text description of the rule, 
rule syntax, 
invocation of said rule, 

reporting of erroneous data to said enterprise-level model, 

name of a stored procedure for checking said rule, 

rule precedence, 

a target table identifier, 

a t arget column name , 
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activation status of said rule, 

status information of whether or not said rule is required for complete data 
quality and integrity, 
an error identifier, 

5 status information of whether or not said rule is traceable back to said data 

from said transaction systems, and 

a parameter list, if required by said stored procedure. 

14. The method according to claim 1, wherein each rule of said 
10 rules database includes a SQL statement. 

15. A system for ensuring data quality and integrity of a data set 
derived from a data source, said system including the steps of: 

means for obtaining data from said data source; 
15 means for building a data repository using said data from said data source, 

said data repository including a data structure that forms a model of said data from 
said data source, said building means including: 

means for applying business rules from a rules 
database to said data from said data source, said business 
20 rules dependent upon meta data; and 

means for detecting any errors in said data and 
storing data satisfying said business rules in said data 
repository. 

25 16. The system according to claim 15, further including means for 

reporting said detected errors for correction of said errors in said data source. 

17. The system according to claim 15, further including means for 

providing an integrated data set for export from said data repository. 

30 
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1 8. The system according to claim 15, wherein said data source 
includes a plurality of transaction systems. 

19. The system according to claim 18, further including means for 
storing said data from said plurality of transaction systems in a staging area. 

20. The system according to claim 15 or 19, wherein said model is 
an enterprise-level model and said business rules are enterprise level business rules 

21. The system according to claim 15, further including means for 
feeding back said errors to said data source for correction. 

22. The system according to claim 21, further including means for 
correcting at least a portion of data of said data source dependent upon an error fed 
back to said data source. 

23. The system according to claim 15, wherein said applying 
means includes means for invoking procedures stored in said data repository. 

24. The system according to claim 15, wherein said meta data is 
stored in said data repository. 

25. The system according to claim 15, including means for 
loading said data from said data source into a staging area. 

26. The system according to claim 25, further including means for 
triggering said building means. 

27. The system according to claim 15, wherein said rules database 
includes one or more attributes for each rule selected from the group consisting of: 

rule type, 
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rule name, 

a text description of the rule, 
rule syntax, 
invocation of said rule, 
5 reporting of erroneous data to said enterprise-level model, 

name of a stored procedure for checking said rule, 
rule precedence, 
a target table identifier, 
a target column name , 
10 activation status of said rule, 

status information of whether or not said rule is required for complete data 
quality and integrity, 
an error identifier, 

status information of whether or not said rule is traceable back to said data 
15 from said transaction systems, and 

a parameter list, if required by said stored procedure. 

28. The system according to claim 15, wherein each rule of said 
rules database includes a SQL statement. 

20 

29. A computer program product having a computer readable 
medium with a computer program recorded therein for ensuring data quality and 
integrity of a data set derived from a data source, said computer program product 
including: 

25 computer program code means for obtaining data from said data source; 

computer program code means for building a data repository using said 
data from said data source, said data repository including a data structure that forms a 
model of said data from said data source, said computer program code means for 
building including: 



[I:\DAYLIB\trBCC\Tenix\6062 1 3J6062 1 3_speci_02.doc:sdb 



-21- 



computer program code means for applying business 
rules from a rules database to said data from said data 
source, said business rules dependent upon meta data; and 

computer program code means for detecting any 
5 errors in said data and storing data satisfying said business 

rules in said data repository. 

30. The computer program product according to claim 29, further 
including computer program code means for reporting said detected errors for 

10 correction of said errors in said data source. 

3 1 . The computer program product according to claim 29, further 
including computer program code means for providing an integrated data set for 
export from said data repository. 

15 

32. The computer program product according to claim 29, wherein 
said data source includes a plurality of transaction systems. 

33. The computer program product according to claim 32, further 
20 including computer program code means for storing said data from said plurality of 

transaction systems in a staging area. 

34. The computer program product according to claim 29 or 33, 
wherein said model is an enterprise-level model and said business rules are enterprise 

25 level business rules 

35. The computer program product according to claim 29, further 
including computer program code means for feeding back said errors to said data 
source for correction. 

30 
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36. The computer program product according to claim 35, further 
including computer program code means for correcting at least a portion of data of 
said data source dependent upon an error fed back to said data source. 

37. The computer program product according to claim 29, wherein 
said computer program code means for applying includes computer program code 
means for invoking procedures stored in said data repository. 

38. The computer program product according to claim 29, wherein 
said meta data is stored in said data repository. 

39. The computer program product according to claim 29, 
including computer program code means for loading said data from said data source 
into a staging area. 

40. The computer program product according to claim 39, further 
including computer program code means for triggering said computer program code 
means for building. 

41 . The computer program product according to claim 29, wherein 
said rules database includes one or more attributes for each rule selected from the 
group consisting oft 

rule type, 
rule name, 

a text description of the rule, 

rule syntax, 

in vocation of said rule, 

reporting of erroneous data to said enterprise-level model, 
name of a stored procedure for checking said rule, 
rule precedence, 
a target table identifier, 
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a target column name , 
activation status of said rule, 

status information of whether or not said rule is required for complete data 
quality and integrity, 
5 an error identifier, 

status information of whether or not said rule is traceable back to said data 
from said transaction systems, and 

a parameter list, if required by said stored procedure. 

10 42. The computer program product according to claim 29, wherein 

each rule of said rules database includes a SQL statement. 

43. A system for ensuring data quality and integrity of a data set 
derived from a data source, said system including: 

15 a data repository including a relational store, meta data, an error log and 

stored procedures; 

a rules database including enterprise business rules affecting the transfer 
of data from said data source to said data repository; 

a data quality and integrity engine coupled to said rules database for 
20 invoking said stored procedures of said data repository on said data, said data quality 
and integrity engine for detecting errors in said data and for controlling transfer of 
said data into said data repository. 

44. The system according to claim 43, wherein said data source 
25 includes a plurality of transaction systems. 
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45. The system according to claim 44, further including a staging 

area for receiving data from said plurality of transaction systems. 
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ABSTRACT 



DATA QUALITY AND INTEGRITY ENGINE 



A method (300, 400), a system (200), and a computer program product for 
ensuring data quality and integrity of a data set derived a data source (2 1 0) are 
provided. The data source may be one or more data repositories or data warehouses, 
or one or more transaction systems. Data from the datai source (210) may be stored in 
a staging area (242). A data repository (220) is built using the data. The data 
repository (220) includes a data structure that forms a model of the data from the data 
source; (210). The building step involves applying business rules from a rules 
database (252) to the data. The business rules are dependent upon meta data (224). 
The building step further involves detecting any errors (254) in the data and storing 
data satisfying the business rules in the data repository (220). A log (226) of any 
detected errors may be maintained in the data repository (220). 

Fig. 2 
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FIG. 4 
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